Welcome to a tutorial on reshaping data in R using the dplyr and tidyr packages. In this tutorial, you’ll learn how to convert data between wide and long formats, understand when each format is useful, and see how to perform calculations more efficiently with long-format data. We will use examples to illustrate these concepts, focusing on economic data such as GDP components for multiple countries.
Prerequisites
Before diving into the analysis, let’s load the necessary R packages. These packages will help us manipulate data efficiently.
library(dplyr) # For data manipulation
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr) # For reshaping datalibrary(here) # For building file paths
here() starts at C:/IMF-R-Book
Understanding Wide and Long Formats
The same dataset can be written in two different formats: wide and long.
Wide Format
In a wide format, values do not repeat in the first column. This format works well if you have a time series of one variable for various countries, as each country’s data for different years can be spread across multiple columns.
In a long format, values do repeat in the first column. This format is more efficient when you have multiple variables for different countries, as it allows for easier manipulation and analysis.
year country consumption investment exports imports
1 2016 BRA 5000 1500 1000 800
2 2016 MEX 4000 1200 2000 1000
3 2016 USA 15000 5000 6000 3000
4 2017 BRA 5100 1550 1100 850
5 2017 MEX 4100 1250 2100 1050
6 2017 USA 15500 5100 6200 3100
7 2018 BRA 5200 1600 1200 900
8 2018 MEX 4200 1300 2200 1100
9 2018 USA 16000 5200 6400 3200
10 2019 BRA 5300 1650 1300 950
11 2019 MEX 4300 1350 2300 1150
12 2019 USA 16500 5300 6600 3300
Reshaping Data
Let’s start by reshaping a dataset from wide to long format. We’ll use the example dataset that contains GDP components (consumption, investment, exports, and imports) for multiple countries.
Wide to Long
To reshape the wide data data into long format with columns year, country, consumption, investment, exports, and imports, we’ll use the pivot_longer() function from the tidyr package:
# A tibble: 12 × 7
year country consumption investment exports imports gdp
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2016 BRA 5000 1500 1000 800 6700
2 2017 BRA 5100 1550 1100 850 6900
3 2018 BRA 5200 1600 1200 900 7100
4 2019 BRA 5300 1650 1300 950 7300
5 2016 MEX 4000 1200 2000 1000 6200
6 2017 MEX 4100 1250 2100 1050 6400
7 2018 MEX 4200 1300 2200 1100 6600
8 2019 MEX 4300 1350 2300 1150 6800
9 2016 USA 15000 5000 6000 3000 23000
10 2017 USA 15500 5100 6200 3100 23700
11 2018 USA 16000 5200 6400 3200 24400
12 2019 USA 16500 5300 6600 3300 25100
Using long format, the calculation is more straightforward and scalable, especially when dealing with large datasets.
Conclusion
In this tutorial, you’ve learned how to reshape data between wide and long formats in R using the dplyr and tidyr packages. We’ve discussed the benefits of each format and demonstrated how long format can simplify complex calculations. These techniques are essential for efficient data manipulation and analysis in R.